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FOREWORD 



This technical report Is the First Annual Report by the 
University of Texas at Austin, Linguistics Research Center, 
Austin, Texas, under contract F30602“70"C"0 1 1 8 , Job Order 
Number 459A0000, for Rome Air Development Center, Griffiss 
Air Force Base, New York. It covers the period from 1 
February 1970 to 31 January 1971. Sgt. Charles S. Bond, Jr. 
(IRDT) Is the RADC Project Engineer. 

This report has been reviewed by the Information 
Office (Ol) and Is releasable to the National Technical In- 
formation Service (NTIS). 

This technical report has been reviewed and Is approved. 




CHARLES S. BOND, JIK, Sgt,USAF 
Technical Evaluator 



Appro\ 




ABSTRACT 



Research in theoretical linguistics 
descriptive linguistics, lexicography, 
and systems design pertinent to the Lin- 
guistics Research System for mechanical 
translation performed at the Linguistics 
Research Center is described. Work in 
the theoretical group concentrated on 
i nt ra-sentent i a 1 disambiguation and on 
improving certain parts of the system to 
achieve greater economy in processing. 
The linguistic group was engaged in cor- 
recting and updating the existing German 
and English lexical data bases by as- 
signing syntactic and semantic selection 
restrictions to lexical items. Work in 
the systems group concentrated on the 
reduction of the size of the existing 
LRS lexical data base without infor- 
mation loss, on the conversion of this 
data base to the LRS subscript format, 
on the construction of supporting pro- 
grams to expedite and facilitate the up- 
dating of the LRS word lists, and on 
the construction of part of the LRS gram 
mar maintenance and systems programs. 




. 4 



EMC 



TABLE OF CONTENTS 



INTRODUCTION 



I. THEORY: The Linguistics Research System 

. 1 Canon i ca 1 Forms 
.2 Normal Forms 
.3 Mechanical Translation 
.k Subscript Grammars 
.5 Syntactic Grammars 
.6 Normal Form Grammars 
.7 Analysis Procedure 
.8 I n t ra- Sen ten t i a 1 Disambiguation 
.$ Changes in Subscript Grammar Format and Storage 
.10 Example for I nt ra-Sentential Disambiguation 



- 1 
-2 
-2 
-3 
-k 
-A 

-5 

-6 

-6 

-8 



I I . LEXICOGRAPHY 

2 . 1 Exi s t i ng Data 11-1 

2.2 P rogress I I - A 

2.3 Development of a General Classification System 11-15 

III. PROGRAMMING 



3. 1 

3.2 

3.3 

3 .^ 



Grammar Conversion Programs 
Systems Programs 
Supporting Programs 
Program Descriptions 



I I I- 1 
I 11-5 
I I 1-5 
MI-6 



CONCLUSION 



Referenoee 



v/vi 



o 

ERIC 






INTRODUCTION 



The difficulties that confront attempts to mechanically 
recognize and produce sentences in natural language generally 
arise from two causes. One ic the lack of a lexicon with pre- 
cise information on the syntactic and semantic properties of the 
context in which these lexical items may occur. The other source 
of difficulty is the concomitant generality of the recognition 
grammars which Is necessary in order to keep the number of re- 
quired rules to a manageable size. As a result of this gener- 
ality, sentences are assigned a vast number of readings ("forced 
ambiguities") in addition to their legitimate readings. 

These difficulties did not change with the advent of trans- 
formational recognizers [7> 9> 10, lif.l in I96i(. Due to the lack 
of comprehensive grammars and a complete set of transformational 
rules, these recognizers cannot be used for the analysis of arbi- 
trary sentences In natural language. (Cf. [8]). It may also be 
significant that the advances in the theory of transformational 
grammar— the incorporation of a lexical component with semo- 
syntactlc and semantic features, the Introduction of output 
constraints, derivational constraints, and trans de r I vat i onal 
cons t ra i n ts— have not been incorporated into transformational 
recognition procedures. 

The dissatisfaction with transformational grammar as de- 
scribed in A6pzct4> ol thz ThzoKy Syntax [2] has led in the 
meantime to a schism among generative linguists, with the uni- 
versal base hypothesis opposing an "extended" standard trans- 
formational grammar. Moreover, the general disaffection for 
the concept of a grammar as a device which generates individual 
sentences can be observed from various attempts to tackle the 
problem of producing or recognizing sentences in discourse by 
positing so-called text grammars. 

We feel that the difficulties in the production of trans- 
formational grammars for language are mainly due to the unneces- 
sary complexity of the transformational apparatus. A transfor- 
mational grammar was, originally, a device 'which generates all 
and on I y the grammatical sentences of a (surface) language. 

The grammar supposedly generated deep structures from which— 
by means of transformations— well-formed surface strings were 
de r i ved . 

The advent of A^pec-t-6 with its lexical component and em- 
bedding of sentences increased the power of the phrase-structure 
component; it was now able to generate well-formed and ill- 
formed sentences. The transformational component obtained an 
additional function, the "filtering function," whose purpose was 
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to delete all output strings which were not well-formed. These 
could be recognized from the occurrence of "non-surface" termin- 
als: dummy symbols, and interna) sentence boundary marks. That 

this filtering function did not suffice to eliminate all ill- 
formed strings has been shown. And, so far, the additional con- 
ditions stated above have had to be introduced in order to 
guarantee the well-formedness of the output string. 

With this in mind the question naturally arises— Why not 
guarantee the well-formedness of the output string by means of 
an output grammar? It is certainly interesting, if not signif- 
icant, that the centers which have made the most Important 
advances in the analysis of sentences in natural languige (the 
Transformation and Discourse Analysis Project at the University 
of Pennsylvania, and the CETA group in Grenoble) operate with a 
transformational apparatus but with a surface grammar. The 
addition of a surface grammar component has an obvious advantage 
in that the linguist is able to describe the strings of language 
in a manner which has been long familiar to him and which lin- 
guistic tradition has used for centuries. The transformational 
component can then be considerably simplified. In particular, 
the ordering of transformations which had originally been 
necessary to guarantee well-formedness of the output string can 
now be taken over by the surface grammar. 

In conclusion — past experience has clearly demonstrated 
that, due to the large number of rules required, surface analysis 
by means of context-free grammars with simple symbols cannot be 
performed. Further, a grammar appropriate for surface analysis 
must permit the linguist to express the linguistic generali- 
zations that he has been accustomed to make: that a sequence 
of constituents forms a cons t i tuen t on 1 y if each constituent has 
the syntactic and semantic properties required for the well- 
formedness of the string. 



In the remainder of this report we give a general outline 
of the linguistics research system (LRS), a grammatical model 
for the mechanical recognition and production of sentences in 
natural language used for machine translation purposes. A more 
comprehens i ve' descri pti on is given in [5]* 

During this contract period, the theoretical group at LRC 
concentrated on disambiguation of sentences and on improving cer- 
tain parts of the system to achieve greater economy in processing. 
Detailed descriptions can be found In [6]. The linguistic group 
was engaged in correcting and updating the existing German and 
English lexical data bases by assigning syntactic and semantic 
selection restrictions to lexica) Items. The systems group was 
concerned with a) reducing the size of existing LRS dictionaries 
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without loss of information while converting them to the new 
LRS format; b) constructing supporting programs to expedite 
and facilitate the updating of the LRS lexical data bases; and, 
c) constructing a part of the LRS grammar maintenance and LRS 
systems programs. Detailed descriptions of these activities 
can be found in Sections II, III, and IV. 
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SECTION I 



THEORY 



THE LINGUISTICS RESEARCH SYSTEM 



The purpose of the Linguistics Research System (LRS) » which 
Is being constructed under this contract at the Linguistics Re- 
search Center of the University of Texas at Austin, is to provide 
a description and explanation of human linguistic capabilities by 
performing recognition and production of sentences In natural 
language, in order to achieve mechanical translation. The LRS is 
a system of components which can be connected like building 
blocks to form larger configurations. Each component consists of 
a set of algorithms and Instructions which are executed by the 
algorithms and which modify the general operations of the algo- 
rithms In a prescribed way. Such Instructions are linguistic 
rules of various kinds: dictionary rules, syntactic rules, and 
interpretation rules, transformation rules, mapping rules, selec- 
tion rules, rejection rules, and others. 

The LRS Is based on the following linguistic assumptions: 

1) that grammatical relations can be more easily and 
correctly stated for so-called standard strings than for 
surface strings; 

2) that surface Information Is necessary for correct 
semantic interpretation; 

3) that synonymous sentences can be reduced to the 
same "universal" representation. 

In Its basic configuration the LRS Is a grammatical model 
for the recognition and production of synonymous surface sen- 
tences with Identical or different deep structures. By deep 
structures we mean the stage of a sentence derivation In standard 
transformational grammar when all base component rules, consti- 
tuent and feature rewriting rules, have applied but before lexi- 
cal insertions have been performed. 



1 . 1 Canonical Forms 

The purpose of this model Is to assoc I a te. wi th each sentence 
In a natural language all Its semantic readings or canonical 
forms (KF), and to derive from a given KF t all sentences with 
the semantic reading t. A sentence which has n distinct semantic 
readings has n distinct KF ' s . Two different sentences t and u 
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which have one semantic reading in common have one KF in common. 
Sentences of different languages which are translations of one 
another have at least one KF in commoc. 

A canonicai form consists of a sequence of connected simple 
KF expressions. K, the ianguage of KF's, has the following pro- 
perties: 

a) Each simple KF expression is a primitive element of 
K (i.e., it has one and only one [atomic] semantic interpre- 
tation). If a surface terminal q has n different senses, 
then n different KF expressions (simple or connected) repre- 
sent the different senses of q. 

b) No two different KF expressions p and q are syno- 
nymous. If two surface terminals have one sense in common, 
then that reading is represer .ed by the same KF expression. 



1 . 2 Normal Forms 

Because of the difficulties involved in the construction of 
KF ' s , LRS represents the meaning of sentences by means of normal 
forms (NF). 

The NF's of a language are distinct from the KF's in that 
NF lexical primitives may represent either atomic (simple) or 
molecular (connected) KF expressions. Thus the NF primitive, 
backttoKi t corresponds to the connection of the four simple KF 
express! ons unmarried* human* adult* mate . 



1 . 3 Mechanical Translation 

The process of deriving from a surface sentence t all the 
NF's of t is performed by the following components: 

the surface component 

the standard component 

the normal form component. 

The surface component assigns to each surface sentence t all 
its syntactic readings according to the surface grammai;and de- 
rives from those a tentative standard string by means of the 
transformation Instructions contained In the rules which apply 
to t. Tentative standard strings consist of complex standard 
terminal symbols. These are surface terminals with their (possi- 
bly disambiguated) dictionary Interpretation, and dummy symbols 
which were introduced by the transformations. Dummy symbols re- 
present grammatical morphemes and elided lexical Items. Elements 
which were discontinuous in the surface are contiguous in the 
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tentative standard strings. 

The standard component then analyzes these strings with the 
standard grammar which assigns a standard description to all well- 
formed strings, and fiiters out all ill-formed strings. 

The NF component finally interprets the readings of the re- 
maining standard strings by means of the NF grammar which assigns 
NF expressions to individual or connected standard subtrees. 

Production, the reversal of the recognition process, is also 
performed in three steps. 

a) The normal form component— by means of the NF gram- 
mar of the output language- deri ves from the NF reading of 
the input sentence, which is identical to the NF reading of 
the output language, all the associated tentative standard 
readings of the output string t. 

b) The standard component— by means of the conditions 
and operations stated in the standard grammar rules of the 
output language- selects all well-formed standard readings 
from the tentative standard readings and filters out all ill- 
formed readings. The standard component then associates 

with each standard reading the corresponding tentative sur- 
face strings. 

c) The surface component— by means of the rearrange- 
ment grammar of the output language — then assigns a surface 
description to all well-formed surface strings and filters 
out all ill-formed surface strings, i.e., those which are 
either not accepted or do not meet the output conditions of 
the rearrangement grammar. The transformation instructions 
associated with the rearrangement rules finally delete the 
standard dummy symbols, reintroduce lexical pieces which 
had been deleted after surface analysis, and rearrange the 
remaining terminals in surface word order. 



1.4 Subscript Grammars • 

Four grammars — surface, standard, normal form, and rear- 
rangement — exist for each language. The non-terminal and termi- 
nal vocabulary symbols of each grammar are co.mplex symbols, ex- 
cept for the terminal symbols of the surface grammar. Each com- 
plex symbol consists of a category symbol and zero or more sub- 
script or feature symbols; each subscript may have zero or more 
values. 

/ 

The grammar rules used during the recognition and production 
of sentences {both of which are performed as a bot tom- to- top 
direct substitution analysis), are generated by the processing 
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algorithms by means of Instructions represented as context-free 
rule schemata. A rule schema successfully analyzes a string of 
vocabulary symbols i f each rule constituent Is Ln-dfs 
the symbol ,t analyzes, and if all the relations stated beLeir 

nn“Ity«S%‘;Ib“nis‘' 'O'-'-esponding 



I f 
i s 



a rule schema i s 



success ^ 



. . .. ully applied, a new vocabulary symbol 

constructed according to the instructions ^ 

cedent of the rule schema. 



stated in the ante- 



i n a 



The conditions that may be stated for individual constituent 
rule consequent are: 

a) A particular category symbol either may not or must 
contain a particular subscript or combination of subscripts; 

A particular subscript symbol may not or must con- 
tain a particular value or combination of values; 

c) Operations between subscripts of different consti- 
tuents may not or must be successful. (These operations, 
the set-theoretical operations Intersection, Sum, and Dif- 
ference, are performed with the values of the specified 
subscripts.) 

The advantages of a subscript grammar are numerous. It per- 
expression of relations such as agreement and government 
W the intuition of the human speaker. Similar- 

ly , g rammat i ca 1 , semantic, and stylistic categories can be con- 
veniently expressed. 



^ • 5 Syntactic Grammars 

Each rule schema of each grammar consists of a syntactic 
part and an optional transformational part. For surface and 
standard grammar, the syntactic part of each rule schema consists 

rules. The transformational part contain 
only transformations whose structural description Is satisfied by 
a string of symbols interpreted by the constituents of the rule 
schema consequent. The transformations possible in surface and 

Thr?rlnc?'’^'"'"f • P«*','J^‘^3tlons, deletions, and Insertions. 

The transformations are "feature sensitive"; In particular it is 
possible to lexlcalize features of, a const! Juent'and to "^aiJr- 
«ze terminal or non-terminal constituents. Thus, words like up 

ao' *^caiJ°be unit with some verbs, e.g., ^ook 6omzthlng^ 

up, can be assigned as a feature to the head of the verbal con- 
struction, resulting in look 6omzthing, 

+ up 

^ 6 Normal Form Grammar 

r..is./?n Grammar differ from surface and standard 

rules in two respects: 

1 -^ 
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a) They apply to connected trees; 

b) They are not rewrite rules. 

An NF rule applies to all trees (terminal, non- term! na 1 , or com- 
binations of them) whose nodes, labeled by complex symbols, are 
non-distinct from the complex symbols In the consequent of the 
NF rule. The antecedent of the NF rule assigns a particular 
semantic reading, an NF expression represented by that antece- 
dent, to all trees to which it applies. Since NF expressions ap- 
ply to trees whose nodes are labeled by complex symbols, it is 
possible to assign a particular NF reading to a terminal k with 
a particular part-of-speech Interpretation and with a particular 
selection restriction. At the same time, all trees ti, t 2 ,...tn 
interpreted by the same NF expression k are substitutable for one 
another, regardless of whether the root and end nodes of tree t» 
are identical or different from those of tree t j . 

It is thus possible to define synonymy relations between 
words of different part-of-speech and between different syntactic 
s t r uc t u r es a nd terminal structures (e.g., lexical units and idio- 
matic expressions; lexical units and phrasal expressions; and, 
lexical units which have an internal variable slot), without af- 
fecting their transformational possibilities. Examples of such 
paraphrases can be found in ( 5 ], pp. T217-68. 



1 . 7 Analysis Procedure 

The recognition and production of strings Is'performed as a 
bot tom- to- top analysis. We believe that analysis procedures like 
those of Earley [3] or those based on state-trans 1 tion diagrams 
[1» 9» lA] do not operate as efficiently with LRS grammars due to 
the complexity of their symbols and the large number of permuta- 
tions of constituents typical of highly inflected languages such 
as German. 

We selected bot tom- to- top analysis for the reasons which 
follow. 

a) It permits an easier treatment of Ill-formed strings 
(k-strings) within well-formed strings which occur frequent- 
ly in translations, e.g., formulas, f ore ign names , foreign 

c i tat i ons , etc . . 

b) It permits the adding of new syntactic or semo-syn- 
tactic values to the lexicon without a concurrent change of 
the non-terminal grammar rules. Assume, for example, that 
one discovers a sub-class of adjectives which modify only a 
certain type of nouns. The addition of the new semantic 
feature under the subscript "type" only requires changing 
the dictionary rules for the nouns and adjectives affected. 
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None of the word formation rules or syntactic rules will 
need to be changed. This advantage would be lost In a top- 
to bottom analysis where, In addition to the dictionary 
rules, the tables for the subscript "type" for nouns and for 
adjectives would have to be changed. 

c) Finally, tree structures which Interpret ambiguous 
strings can be conflated to a single tree structure If all 
labels of the tree nodes have the same category symbol. 

The number of intermediate analyses, similar to state tran- 
sition diagrams, is thus considerably reduced. A similar 
conflation occurs In the representation of the normal forms 
of sentences which contain semantically ambiguous items. 

The economy of this analysis procedure was further increased 
by the introduction of: 

a) left-context-sensitive dictionary analysis (cf. 

3 . 4 . 1 ); 

b) intermediate choice algorithms which — based on 
well-formedness conditions — destroy all inappropriate 
readings after dictionary and word analysis; and, 

c) context-sensitive rejection rules, which apply 
during word analysis and whose Instructions are executed 
during word choice. Word Choice tags all those nodes on 
which no syntactic rule may build within the analyzed text. 



1 . 8 I nt ra- Sentent 1 a 1 Disambiguation 

The most powerful feature Is the system's capability of per- 
forming semantic disambiguation of lexical items In context after 
sentence analysis without having to trace down the tree branches 
from the node S. This is made possible by means of trace opera- 
tors which are associated with the disambiguating values of am- 
biguous lexical items. These operators cause the system to re- 
member the location of these lexical items and to disambiguate 
them only if a disambiguating environment is given. 



1 .9 Changes in Subscript Grammar Format and Storage 

Certain modifications in the. format of writing and storing 
subscript rules were made during the reporting year. The most 
significantare: 

1) the now- perm i s s ib 1 e separation of condition and 
operation statements in subscript rules, and, 

2) the method of storing the grammar for actual 
ana lysis. 
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1.9.1 New Format for Operation Statements 



The encoding scheme 


below was 


i n t roduced 


i n order to e 1 i mi - 


nate the ambiguity resulting from 
For example, consider the rule: 


two or more 


1 i nked operat I ons . 


Q 




(D 


0 


C 5 V NP 


V DET 


V A 


V N 




- 3. I6D 


. 4.IGD 


$ GD 



(The encircled digits Identify the rule fields.) Under the old 
convention It Is not obvious whether the operation in field 2 
(I.e., -3. 160) means: perform the difference operation between — 



a) the value set of the workspace subscript GO for DET and 

the value set of the workspace subscript GO for A, 
or 

b) the value set of the workspace subscript GO for DET and 

the value set resulting from the intersection indicated 
at 3. I . 

In the first, a), the operations at 2.1 and 3.1 are disjoint 
and can be done In any order. In the second, b) , the operation 
stated in 3>l must be done first. 

The operation statement for a subscript may now be separated 
from its condition statement. Rule 12, which was originally en~ 
coded as 

C 12 V NO = V A 

. 3.IGD 
or 

C 12 V NO » V A 

$ GO 

may now be encoded as 

C 12 V NO <= V A 

$ GO 
. 2.1,3. 

The statement ". 2.1,3. I" represents "perform an intersection 
between the value sets of the subscript names enumerated at 2.1 
and 3.1", i.e., GO of A and N. 

Since the system treats a separated operation statement as 
if it were also a subscript, sequences of linked operations can 
be stated In a straightforward manner. Thus, for example, read 
i ng b) in Rule 5, above, can now be represented as — 

C 5 V NP = V DET V A V N 

$ GD $ GD $ GD 

-2. 1, 3. 2 . 3.1,4.! 

whereas reading a) is represented as — 



V N 

$ GD 

V N 

. 2.1GD 



V N 
$ GD 
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C 5 V NP 



DET 

GD 

2. 1 ,3. 1 



V A 
$ GD 
.3.1 ,A. 1 



V N 
$ GD 



No condition Is Imposed on the position where separated ope- 
ration statements may occur. It Is thus possible to place them 
In the most advantageous position from a processing point of 
view, I.e., the left-most constituent In a rule consequent, as 
for version b) at the bottom of the preceding page — 

C 5 V NP » V DET V A V NP 

$ GD $ GD $ GD 

. 3 . 1 ,^.! 

- 2. 1 , 2.2 



1 . 9.2 Storage of Analysis Grammars 

The manner In which the word and syntax grammars are stored 
has a great Influence on the speed of analysis. After Investi- 
gating how the word and syntax algorithms would operate, a 
storage structure using a reverse columnar approach was chosen. 
The grammar In question Is stored by columns, the first column 
containing all the unique last terms of rule consequents. The 
succeeding columns contain the penultimate terms, the antepenul- 
timate terms, etc.. Associated with each term Is a list of rule 
numbers In which It occurs. Each terminal, I.e., the left- mos t 
term of a rule consequent. Is marked and has a pointer to Its 
antecedent term. 

The analysis programs construct the actual rule by means of 
the analyzed Individual terms and their associated rule numbers. 
Since each unique nth rule term Is stored only once, the method 
of storing the grammar as described above should facilitate the 
analysis as well as use a minimum of storage. As the grammar 
might exceed available core memory space, the storage method also 
ensures that most or all of the grammar that the analysis program 
needs at one time Is kept In memory. If last terms are being 
analyzed for Instance, all the last terms can be In memory; If 
penultimate terms are being analyzed, all the penultimate terms 
can be In memory, etc. We anticipate that this method of stor- 
ing the grammar will result In a considerable Increase In pro- 
cessing speed. 



1.10 Example for I n t ra- Sentent I a 1 Disambiguation 

The capabilities of LRS for performi ng . I ntra-sentence dis- 
ambiguation may be shown by. the analysis and standardization of 
the Eng 1 I sh sentence 

TAe page, ^tzpt 

In this sentence, the noun page. Is ambiguous; one of Its semantic 
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readings Is the reading BOV, another the reading PAGIA/A. This 
ambiguity Is represented In the dictionary rule 2 below, which 
applies to the noun page, by the subscript TY (for type) with 
the values HU and IN for HUman and inanimate. This ambiguity 
is resolved In the context of the verb which requires an 

animate subject. Indicated by the subscript TS with the value 
AN In rule 6. During the analysis of this sentence, the rule 
schemata apply In the order indicated by their numbers. 



English dictionary and grammar rules 



DET 

NU(S,P) 


>v 


THE 


N 




PAGE 


TY(HU, IN) 
CL(05) 






1 . 1 






N 


V 


N 


2.1 (X+AN+PO) 
2 


$ 


TY(*AN+*P0+HU) 


N 


V 


N 


NU(S) 

2 


$ 


o 

r- 

o 

• 

• 

o 

vn 


NP 


V 


DET V N 


PS(3) 


• 


3.1NU $ NU 



V 

$ 

A 

V 
+ 

A 

V 
+ 

$* 2.1 

A 3 

CHOICE 
S m 

V V 

+ CL(15) 

+ 0B(0) 

+ TS(AN) 

V V 

+ CL(07) 

+ 0B(0) 

+ TS(AN) 

V VP 

+ TN(PA) 

+ PS(1,2,3) 
+ NU(S,P) 

A 2 



(m = 2-1) 
* SLEPT 



= * SLEEP 



V V 

$ CL(...15) 
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8 VS = V NP V VP 

$ 3.3 . 3.1NU $ NU 

$>''2.3 . 3.2PS $ PS 

. 3.4TY $ OB(»vo) 



# 0 AUX 0 # 

$ 3.5 
$ 3.6 
$*2 . 1 
$* 2.2 



$ TS 
$ TN 



$ VC(A) 



CHO I CE 
A 1 . 1 (3.3) 

A 1 .2(2. 3, 3. A) 
S n 



(n = 2-i»-l-3-5) 



Rule 1 assigns the word tkz the Interpretation DETertniner and 
states that its NUmber is Singular or Plural. 

Rule 2 assigns the word page the interpretation Noun of the 
paradigmatic Class 05 and the values HUman and INanimate of the 
subscript TY . The subscript T 1.1 indicates that the values in 
the first subscript of the first rule term, in this case of TY , 
represent semantic ambiguity. The effect of the T operator is 
that the address of this subscript, given In brackets in the tree 
diagram below, is associated with the subscript TY. 

Rule 3 is a redundancy rule which states that all nouns with the 
value Human which have neither the value ANImate nor the value 
Physical Object add the values ANimate and Physical Object. 

The expression A 2 In the antecedent Is an Instruction to the 
algorithm to carry along all the subscripts of the second con- 
stituent not mentioned in the second rule term. 

Rule k states that nouns of particular paradigmatic Classes, If 
followed by zero ending, become nouns with the NUmber Singular. 
Again, A 2 results in the carrying along of the non-ment 1 oned 
subscr i pts . 

Rule 5 states that the sequence of DETerminer and Noun results 
in a Noun Phrase, provided that the DETerminer and the Noun 
agree in NUmber. The instruction . 3.1NU is to be read as 
“ intersect the values of the subscript NU with the* values of the 
first subscript of the constituent matched by the third rule 
term . “ The Noun Phrase Is assigned the feature “third person" 
and the NUmber in which the two terms agree; the non-men t i oned 
subscripts of term 3 are carried along. 

Rule 6 assigns the word 6te.pt the reading Verb of paradigmatic 
Class 15. 0B(0) stands for " requires zero object ." TS(AN) 

stands for " the subject must be animate ." As we see in the 
next rule, allomorphs of a morpheme are assigned the same rule 
numb-er. They have in common all subscripts excep** for the sub- 
script which indicates paradigmatic Class. 
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Rule 7 rewrites all Verbs of CLass 15 as full VerBs In the 
PAs t TeNse, In the first, second or third PerSon and In the 
NUmber Singular or Plural. A 2 results in the carrying along 
of all features of the underlying verb. 

The syntactic part of rule 8 consists of the first three terms 
which rewrite a Noun Phrase followed by a Verb Phrase as a 
Sentence provided the Noun Phrase and the Verb Phrase agree 
in NUmber and PerSon and provided that the TYpe of the Noun 
Phrase has a value in common with the subscript TS of the Verb 
Phrase. In addition, the verb phrase must dominate an intran- 
sitive verb (objects of transitive verbs are dominated by S 
not by VP). These subscripts are artificially associated with 
S to permit an easier execution of the rule's choice statement. 

The application of these rules to the input sentence results In 
the following analysis: 



8 S 
0B(0) 

TY(AN)[8.1.1] 




Note that 
** apace " 
ooeura aa 
the 4th 
and 9th 
text aym- 
bota. 




1 



After syntactic analysis, the choice statements in rule 8 are 
executed. A 1.1 (3. 3) reads " take th e value of the first s ub- 
script in field 1 and weight it In the address associated with 
the third subscript in field 3 if there is such an addressT " 

Thus only the Instruction A 1.2(2. 3) of A 1.2(2. 3, 3.^) is execut- 
ed. Syntactic choice also Introduces the dummy terms of rule 8 
and assigns the order 2-4-l-3”5 to the terms and dummy terms in 
the rule consequent. 



The standardization program then derives the following dis- 
ambiguated string, where the noun page no longer has the features 
"human or inanimate" but only "human," as indicated by the sub- 
script and value TY(HU) below. 



D # C 2 
N 

TY (HU) 
CL(05) 



Cl D AUX 

DET $ PS(3) 

NU(S,P) $ NU(S) 

$ TN(PA) 



C 6 D j$' 

V 

CL(15) 

T0(0) 

TS (AN) 



The noun 



i s 



assigned the interpretation BOV 



by the NF rule 



V BOV = C 2 
D 0 $ TY(HU) 



which we can represent by the graph 




BOV 



ad the input sentence oeen i ^ 

ould have been possible. In that case, the standard represen- 

atlon of the noun page would have been 



C 2 
N 

TY(HU, IN) 
CL(05) 



to which the two NF rules below would have applied, reflecting 
the semantic ambiguity. 



V 60/ 
D 0 



C 2 and V PAGJNA = C 2 

$ TY(HU) do $ TY (IN) 



The two 



resulting German translations would then be; 

and S/ce 6aken dZe Seite, 
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Sie hahevL den Knaben 



SECTION II 






LEXICOGRAPHY 



2 . 1 Existing Data 

The German and English lexicographic data which were avail-* 
able at the beginning of the reporting period included the German 
monolingual mach ine-processab le dictionary, two English mono~ 
lingual mach i ne»proces sab 1 e dictionaries, a German verb list, 
an English verb list, and German-Eng 1 i s h past participle and 
noun 1 r s ts . 



2.1.1 The German Dictionary 

The German dictionary consists of approximately 40,000 
entries. Since stem variants of nouns, adjectives, and verbs 
constitute separate entries, these 40,000 dictionary entries 
represent approximately 35>000 German word stems. Each entry 
is classified as belonging to one of the following categories: 
noun, adjective, verb, adverb, determiner, pronoun, preposition, 
conjunction, or separable verbal prefix. In addition to these 
categories, paradigmatic features are assigned to nouns, ad- 
jectives, and verbs. Nouns have a feature, "gender," which 
identifies them as masculine, feminine, neuter, or (i n the case 
of pZuraZta tantum nouns) plural. Adverbs which may be used to 
modify nouns are marked with respect to their position relative 
to the modified noun phrase: preposed (e.g., 6ogcLK d^(L Roeme/t.) , 

or postposed (e.g., Satz . 



2.1.2 German Lexical Lists 

In order to expand the LRC machine processable diction- 
aries, lists of German verbs and of past participles commonly 
used as adjectives had previously been compiled, and compilation 
of a list of German nouns had begun. All information was coded 
from the German-Engtish Engtish-German Dictionary by Wildhagen 
and Heraucourt since this dictionary contains a comparatively 
large amount of explicit syntactic and semo-syntact I c infor- 
mation. 



2. 1.2.1 The German Verb List 

The German verb list originally contained approximately 
18,400 entries. To this list, all stem variants of irregular 
verbs were added automatically, resulti.ng in a total of approx- 
mately 30 ,000 entr I es . The entries contain a large amount of 
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syntactic but only a modicum of semo-*syntact i c information (ap~ 
proximately for 12% of all verb entries). 

a) P re f 1 xes precede the verb stem. Separable prefixes are 
followed by a blank space, inseparable prefixes by a hyphen and 
a blank space. Examples: 

AUF STEH (separable prefix) 

VER- ZWEIFEL (inseparable prefix) 

Note that the infinitive endings are stripped from the stem. 

All stem variants of irregular verbs are entered in the dic- 
tionary with identical semo-sy n tact I c and syntactic information, 
but with different paradigmatic information. 

b) Tr a n s i t i v i ty . Each verb in the verb list is identified 
by a descriptor as transitive (VT) , intransitive (Vl), or re- 
flexive (VR) . 



c) Case government is indicated for all transitive verbs 
as genitiTe ( GEN ) , dative (DAT), or accusative (VT or VR) . 
Descriptors indicating case government may also contain infor- 
mation about the semantic type of the object: 



JDN 

JDH 

JDS 

ETW 

ETW DAT 
ETW GEN 
VR DAT 
E-A 
ES 



human, accusative 
human , dative 
human , genitive 
non-human, accusative 
non-human, dative 
non-human, genitive 
reflexive, dative 
reciprocal (z^YiCiYidzK) 
object must be 0.6 



d) Verbs which govern prepositional objects are marked by 
the sped f i c p repos 1 1 ion (s ) they may take. Those prepositions 
which govern either dative or accusative are distinguished by 
case descr i p to rs : 

AN ACC a an with accusative 

AN DAT = an with dative 

Prepositions are followed by descriptors specifying the semantic 
type of object required (as in AUF JDN), or by SICH (if the 
prepositional object must be reflexive), whenever this infor- 
mation was recognized in Wildhagen. 



e) 



The semantic type of subject required by a verb ( i f 
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indicated) is shown by (P) for human, (T) for animal, or (S) for 
i nan i ma te . 

f) The auxiliary taken by a verb in perfect tense forms 
is indicated as follows: 

takes = S 

takes either kabzn or 4e^n = S H 
takes habzn = unmarked 



2. I .2.2 The Ge rman^Engl Ish Noun List 

Work on the compilation of a list of German nouns had been 
in progress for some time. The information coded includes gender, 
number (for pluralia and einaularia tantum nouns), case govern- 
ment (including prepositions) for deverbative nouns, and English 
translation equivalents. Whenever information was given in 
Wildhagen about the area of discourse to which a particular 
translation Is restricted, this information was coded with the 
proper translations. 



2. 1.2. 3 The German-Engl i sh Past Participle List 

The list cf German past participles consists of approxi- 
mately 1,100 entries. It contains primarily those past parti- 
ciples which are frequently used as adjectives and whose meaning 
and translation cannot be automatically derived from the under- 
lying verb stem, e.g., (ux{^Q(Lbfuic.kt or or the Eng- 

lish adjective (LXQ.i,tQ.d , [Note the difference in meaning between 
the past participial and the adjectival usage: 

Tht (LlittKOYL (VCL6 tKclttci. (passive) 

Tkz man wa6 ZKcZtzd, (act i ve) ] 

Also included are past participles whose stem does not function 
as a verb in modern German, as for example, b e.-6 tue./Lzt (agfia^tj , 
or bziag^ lag&d) . The descriptors coded with these entries 
indicate case government (including prepositions), semantic type 
of object required, and English translation equivalents. 



2.1,3 The English Dictionaries 

The English lexicographic data base existing at the begin- 
ning of this reporting period consisted of two monolingual 
mach i ne - p rocessab le dictionaries*, a) the so-called WEBSTER dic- 
tionary, based on Webster *8 New Collegiate Dictionary^ which con 
tains approximately 77,500 entries (A7,300 nouns, 20,100 adjec- 
tives, 9,200 verbs, plus adverbs and function words), and 
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b) the so-called LRMD, which was derived from the Russian Master 
Dictionary (RMD) and contains approximately ^7»300 entries (3^,000 
nouns, 7,800 adjectives, ^,800 verbs, plus adverbs and function 
words). Each word stem Is assigned to one of the categories: 
noun, adjective, verb, adverb, determiner, pronoun, preposition, 
or conjunction. In addition, nouns, adjectives, and verbs are 
assigned to paradigmatic classes and have a feature indicating 
vocalic or consonantal onset. Nouns In the LRMD are also sub- 
classified as human or non-human. A small set of adjectives has 
a subscript Identifying them as possible post-nominal modifiers, 
e . g . , . 



2.\A The English Verb List 

The English verb list was compiled from The Advanoed LeciT- 
ner's Dictionary of Current English by Hornby, Gatenby, and Wake- 
field, and contained approximately 6,400 entries. The syntactic 
Information given for list entries Included the permissible types 
of complementation for each verb: objects (direct and Indirect, 
either of which may be In the form of a prepositional phrase), 
predicative complements, adjectives, adverblals. Infinitives 
(unmarked and marked by to), present and past participles, that- 
clauses. Interrogative clauses, gerunds, and combinations of 
these . 



2 .2 Progress 

The lexicographic work done during the reporting period con- 
sisted of the revision of the existing lexical lists, the addition 
of translation equivalents, and the development of a general sys- 
tem of syntactic and semo-syn tacti c features to be used In the 
further s ub c 1 ass I f I cat I on of lexical elements. 



2.2.1 Purpose 

The purpose of the lexicographic work performed at the Cen- 
ter Is manifold: 

a) to make the LRC mach I ne *>p roces sab 1 e dictionaries as 
comprehensive as possible in order to provide for maximum recog- 
nition of lexical elements in input texts and for all necessary 
translation equivalents; 

b) to prevent ambiguous readings of phrases and sen- 
tences by means of lexical Information; 

c) to permit the selection (on the basis of lexical 
features) of the proper translation equivalent for a lexical I tern 
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from two or more translation equivalents; 

d) to guarantee production of wei Informed sentences 

only. 



The first of these points is obvious. The following exam- 
ples may illustrate point b): 



Thzy 6ent the. mt^stle. tg^ the. moon. 

yes 



He monttoA.ed the. 





itight to the. moon. 




For both examples (as for each surface sentence in which a prepo- 
sitional phrase immediately follows a noun phrase in post-predi- 
cate position), there are two possible analyses. One analyzes 
the adverbial as a post-nominal modifier (represented by an arrow 
above the sentence in the examples given). The other analyzes It 
a verb modifier (arrows below the sentences). Given the 
necessary syntactic or semo-syntact i c information, such sentences 
can often be disambiguated. In this instance, the relevant dis- 
tinction lies in whether the verb and noun may be modified by an 
adverb of direction. (The correct analyses of the examples above 
are 1 nd i cated. ) 



An example for point c) , the need for selection of proper 
translations based on lexical information, is the German verb 
ab ^ue.tte.fin . It has two possible English translations: ^e.e.d if 
the object is animate, tXne. if the object is inanimate (more 
precisely, articles of clothing). The choice between these two 
translation equivalents can easily be made if the distinction 
between animate and inanimate objects is made in the verb dic- 
tionary, and if nouns are sub-classified accordingly. 



Point d) , the production of well-formed sentences, is clear: 
English verbs require certain syntactic patterns and may not be 
used in other patterns. For example, the German sentence 

Ste. zfLklae.fi.te.n thm da.6 PAobtem. 

must be translated as 

The.y e.xptcu.ne.d the. pAobte.m to htm. 

but not 

*The.y e.xpicu.md htm the. pA.obte.rn. 
while the sentence 

(fftA. gabzn dte.6em Phae.nome.n e.tne.n ne.ue.n Hame.n. 
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may be translated as 

We gave, a muf name, to tkt6 phe.nome.non. 
or We gave. thi6 phznome.non a new name. 

The Information coded in the various lexical lists will 
later be added to the German and English mach i ne-processab 1 e die 
tionaries of the Center. 



2.2.2 Work Done: German 



2.2.2. 1 The Ge rman-Eng 1 i s h Noun List 

Compilation of the German noun list was continued. For each Ger 
man entry we coded English translation equivalents and any rele- 
vant features indicating gender, number (for tanbum nouns), case 
government (including prepositions) for deverbative nouns, and 
area of discourse or stylistic level (e.g., <TECH>, <MED>, 
<PHYS>, etc.). This work progressed through the German noun 
ExzC44, reaching a total of approximately 20,000 German nouns. 

2. 2. 2. 2 Revision of the German Verb List 

The German verb list was revised In Its entirety. This re- 
vision included the following: 

a) correction of miscoded or mispunched key-words or 
descriptors, and addition of missing entries; 

b) addition of case Information to all German preposi 
tions which may be us'^d with either dative or accusative; 

c) addition of the descriptors Zl (za-tn^tnitiv ^ I.e. 
marked Infinitive) and DASS {that-c\ avse) to those German verb 
entries which take one or both of these verb complements; 

d) introduction of the symbol + between verb comple 
ments which may be used as double objects with the particular 

ve rb . 



2. 2. 2. 3 Addition of the Translation Equivalents 

The English translation equivalents given In Wlldhagen for 
German verbs were added to the revised German verb list. in the 
process of this work, German verb entries were split into more 
than one entry whenever different English translations for a Ger 
man verb could be associated with specific groups of German fea- 
tures . 

Examp 1 es : 
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VER- MESS 


VT 


MEASURE 


(EO LAND) 


VER- MESS 


VR 


MEASURE 


1 NCORRECTLY 


VER- MESS 


VR + ETW GEN Zl 


= DARE, 


VENTURE 



as in: 

S-ce ve.Kma^6tn cUz4>z Gzgznd. 

Thzy mea^u^ed thi& akza. 

Vabzi hattzn 4^e &ic.h ve-'tme44en. 

Tkty have. mta6uA.e.d lnc.oH.H.e.c.tty In thl4> c.a4c. 

S^e vzKme6&e.n dleiz yzn.mutung al& Tat^achzn 

hinzaAttllzn, 

The.y daKt to Ji&p^&Aent th&6e. aAAumpttonA (U ^actA. 

Additional information which was given in Wildhagen for the 
purpose of selecting proper translation equivalents was coded 
with each English transiation to which it pertained. This type 
of information consists of: 

a) the area of discourse in which a particular trans- 
lation would be used (given in the list in angled brackets, e.g., 
<PHYS> <MED>, etc.); or, 

b) seiection restrictions in the form of particular 
nouns given as sample subjects or objects of the German verb or 
of its English translation. These were added to the translations 
and were marked as English or German, subject or object, by two 
preceding letters: ES (English subject), EO, GS, or GO. 

In addition, some English translation equivalents in Wild- 
hagen are accompanied by syntactic or semo-syn tact i c information. 
Such data was incorporated in the noun list in the form of four 



des cr i p tors : 






AP 


• 

- a person 


(human object) 


AP 'S 


= a person 


's (human possessive pronoun) 


ATH 


s a th i ng 


(inanimate object) 


DS 


= oneself 


(reflexive object) 



Finally, verb entries which are used in Wildhagen in a verb 
phrase with the German verb laAAtn [tat, kavt^ as in have. Aome.- 
one do Aotmthing)were marked in this bilingual verb list. This 
information will be used in future studies of verb phrases of 
this type. 



2 . 2.3 



Work Done : English 

beginning of this contract period, the English 
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verb 






list (EVL) consisted of 6,5^7 entries which had been copied from 
The Advanced Learner's Dictionary of Current English by Hornby, 
Gatenby, and Wakefield. This is, to our knowledge, the only die 
tionary which Indicates for verbs the object complement and ad- 
verbial complement environment In which the verb may occur. 

Apart from Its value as a tool for linguistic analysis, the 
EVL was created for two reasons: to guarantee the production of 
well-formed English sentences, and to be able to associate with 
a particular verb the syntactic pattern in which the verb can be 
used with a given meaning. 



2 . 2 . 3 . 1 Classification in the Hornby Dictionary 

In addition to the classification indicated by the patterns 
below, verbs are redundantly marked as transitive or intransitive 
if this Is applicable. Verbs which require a reflexive object 
are marked as VR; modals and auxiliaries, as "anomalous finites". 

VERB PATTERNS 



P 1 . Verb Direct Object 
He ca^ iJ^wgeA. 

P2 . Verb (not) to + Infinitive 
He Intzndzd to go. 

p 3 . Verb + Noun or Pronoun + (not) to + Infinitive 
I told the, 6Q.n.vant to optn the. uitndou). 

p/i . Verb + Noun or Pronoun + (to be.) + Complement 
We pA.ove.d him {to be) u)A.ong. 

p 5 . Verb Noun or Pronoun + Infinitive 
They ielt the hoiue -ihake 

P 6 . Verb + Noun or Pronoun + Present Participle 
They le^t me standing outride. 

py. Verb + Object + Adjective (object complement ) 
The 4>un keep6 ufaA.m. 

P 8 . Verb + Object + Noun (object complement ) 

They named thetA. 6on HenH.y. 

P 9 . Verb + Object + Past Participle 
She had a neu) dA.e^6 made. 

I 1-8 
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PIO. Verb Object Adverbial Adjunct 
Put It haKt. 

PI 1 . Verb + >tfeg>t-C1ause 

He nxpZatnzd that nothtng could be done. 

PI 2. Verb + Noun or Pronoun + ^feg^-Clause 

Ole i>atli>ited ou^AelveA that the plan Mould Mo^k. 

PI 3. Verb + Interrogative Adverb (except whu) + to + Infinitive 
He Ia learning hoM to 6uftm. 

PI 4. Verb + Noun or Pronoun + Interrogative Adverb (except Mhu ) 

+ to + Infinitive 

The patte^n& 6hou) you (• >o to make lentence^. 

PI 5. Verb + Interrogative Adverb Clause 
I don't mind whe^e Me go. 

PI 6. Verb + Noun or Pronoun + Interrogative Adverb + Clause 
They uiked u6 Mhen Me Mould be back. 

PI 7. Verb + Gerund 

Group A - replacing the gerund with an infinitive results 
in a change of meaning. 

Ole stopped talking. 

Ole stopped to talk. 

Group B - the gerund may be replaced by an infinitive with- 
out a change of meaning. 

He began talking. 

He began to talk. 

Group C - the gerund is equivalent to a passive infinitive. 
That need6 explaining. 

That need6 to be explained. 

PI 8. Verb + Direct Object Indirect Object 

Group A - the indirect object is preceded by the prepo- 
si ti on to and may occur without a preposition before the 
di rect object. 

ThAOM that ball to me. 
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ThAotiO me that bait. 






Group B - the indirect object is preceded by the prepo- 
sition ion. and may occur without a preposition before the 
direct object. 

Have, you le.it any ion. youn. 6t6te.n.? 

Have you te.it youn. ■ii^te.n. any? 

Group C - covers all direct object + indirect object 
constructions other than those stated in Groups A and B. 

I e.xptatne.d the. diHicutty to him, 

P19. Verb + Indirect Object + Direct Object 

Group A - are those verbs which can be used with the 
preposition to in Pattern 18A. 

He. handzd me. the. book. 

He. hande.d the. book to me.. 

Group B - are those verbs which can be used with the 
preposi ti on ion. in Pattern 18B. 

^uy me one. 

Buy one. ion. me.. 

Group C - are those verbs which are rarely or never used 
in Pattern 18. 

I 6tn.uck him a hzavy btow. 

P20. Verb + (ion.) + Complement of duration, distance, price 
or wei ght 

Thz n.ain tasted lion.) a whole, weefe. 

Jt co4t ten dotlan.6. 

P21. Verb alone 

These are intransitive verbs. Some verbs which are nor- 
mally used with an object may also be used In this pat- 
tern, the object being understood. 

Tin.e bun.nA . 

The moon Ko^e. 

P22 . Verb + Predicative Complement 
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Thi6 i6 a boat. 



1 






P23. Verb + Adverbial Adjunct 
We mu6t tafin back. 

P24. Verb + Preposition + Object 

The verb and preposition combine to form a new transitive 
verb followed by an object which can be a noun, pronoun, 
gerund, phrase or clause. 

Look at tht btackboaJLcl. 

He cattzd on mt. 

P25. Verb + + Infinitive 

Group A - the infinitive is one of purpose or aim. 

I went to bay 60mc book6. 

Group B - the infinitive indicates result or outcome. 

How can I get to know heft? 

Group C - the infinitive is equivalent to a co-ordinate 
clause. 

He awoke to ilnd the koa&e on {^tfie. 

Group D - the infinitive is the main verb. 

I chanced to meet htm tn the paJik. 

Group E - the infinitive is used after finites of be for 
a variety of meanings. 

Nobody t6 to know. 

Tht6 I wa4 to leafin JtateA. 

Group F - contains as the only member the verb gotng tot 
He t-6 gotng to waZk home. 

2 . 2 . 3 . 2 Frequency of Patterns in EVL, I969 

The entries in EVL were subjected to a glossary run. The 
results are represented in Table I, which follows. 
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TABLE I: Frequency of Patterns in EVL 



Pattern Frequency in EVL 



PI 4248 

P2 80 

P3 120 

P4 59 

P5 14 

P6 17 

P7 96 

P8 24 

P9 7 

PIO 1670 

PlOB 1 

Pll 181 

P12 26 

P13 42 

P14 8 

P15 61 

P16 10 

P17. 14 

P17A 55 

P17B 16 

P17C 3 

P18 910 

P18A 17 

P18B 80 

P18C 8 

P19 76 

P19A 16 

P19B 9 

P19C 10 

P20 78 

P21 2074 

P22 45 

P22D 1 

P23 1372 

P24 1121 

P24A 1 

P24B 1 

P25 139 

P25A 1 

P25B 1 

P29* ....... .312 



*P29 refers to verbs which were not 
classified in the Hornby Dictionary. 
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Subsequent Work 



2. 2. 3.3 



The purpose of the work performed during the first year of 
the contract period was to improve EVL by making the original 
classification scheme more precise, and to add to it the same 
semo-syn tact i c selection restrictions as those of the German verb 



list. 



Thus, two of the Hornby verb patterns, P10 and P23 , were 
redefined. Pattern 10, for which Hornby gives as examples 

He brought hi6 bfiothzfi to me. 

Tkzy tfLZOit thzlfi CL6 li 6hz wzfiz only a 6zxvant. 

was restricted to 

Verb + Object + Movabie Adverbial Particle 

Hz took oii ht4t kat. 

Hz took kat oii. 

Similarly, Pattern 23 was defined as 

Verb + Adverbial Particie 

Gzt up. 

Sit down. 

The actual updating of EVL involved: 

a) the addition of the adverbial particles with 
which each verb in the new PIO and P23 could occur; 

b) the addition of the prepos I 1 1 on (s ) which each 
verb in P2^» ( Verb + Prepositional Object ) required; 

c) the subci ass i fi cation of the verbs In the general 
classes PI7, PI 8 , and PI9 into the corresponding subclasses A, 

B, and C shown above in 2. 2. 3*1; 

d) the specific classification of ail verbs which 
had, as a stop~gap measure, been assembled under P29j and 

e) the addition of the descriptors H, N, M, K, I 
(for: human, non-human animate, non-animatc, non-anlmate con- 
crete, non-animate abstract, respectively) to all patterns in 
which a noun phrase object complement occurred. 

This updating process resulted in a new EVL, which consists 
of 10,431 entries. Comparison of frequency of descriptors In 
the new and the original EVL is made in Table II, which follows. 
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TABLE II: Frequency of Patterns In New and Original EVL 



Pattern 



new EVL 



Original 






PI . 










4267 


• 




.3^8 


P2 . 










. 79 


• 




. 80 


P3 . 










. 122 


• 




. 120 


P4 . 










. 59 


• 




. 59 


P5 . 










. 14 


• 




. 14 


P6 . 










. 17 


• 




. 17 


P7 . 










. 92 


• 




. 96 


P8 . 










. 22 


• 




. 24 


P9 . 










7 


• 




7 


PIO. 










.1269 


• 




.1670 


P10B 












• 




1 


Pll. 










. 179 


• 




. 181 


P12. 










. 27 


• 




. 26 


P13. 










. 41 


• 




. 42 


P14. 










7 


• 




8 


P15. 










. 70 


• 




. 61 


P16. 










. 10 


• 




. 10 


P17. 










. 15 


• 




. 14 


P17A 










. 80 


• 




. 55 


P17B 










. 16 


• 




. 16 


P17C 










3 


• 




3 


P18. 












• 




. 910 


P18A 










. 63 


• 




. 17 


P186 










. 32 


• 




. 80 


P18C 










.1743 


• 




8 


P19. 












• 




. 76 


P19A 










. 58 


• 




. 16 


P19B 










. 30 


• 




9 


P19C 










. 14 


• 




. 10 


P20. 










. 76 


• 




. 78 


P21 . 










.2166 


• 




.2074 


P22. 










. 42 


• 




. 45 


P22D 












• 




1 


P23. 










. 866 


• 




.1372 


P24. 










.1778 


• 




.1121 


P24A 












• 




1 


P24B 










. 


• 




1 


P125 










. 139 


• 




. 139 


P25A 










1 


• 




1 


P25B 










1 


• 




1 


P29. 










. 


• 




. 312 
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The complete list of entries in the new EVL, subdivided as 
foiiows, is attached to the report, Normalization of Natural 
Language for Information Retrieval by Lehmann and Stachowitz. 

a) Verbs which are both transitive and intransitive 

1) consisting of more than one word 

2) consisting of one word oniy 

b) Verbs which are transitive only 

1) consisting of more than one word 

2) consisting of one word oniy 

c) Verbs which are intransitive only 

1) consisting of more than one word 

2) consisting of one word oniy 

d) Verbs with prepositional object or double object. 



2.2. 3*A New Classification 

The experience gained during this year — especially through 
the acquisition of English translation equivalents for German 
entries — showed that the classification scheme set up so far was 
not adequate. In order to improve disambiguation, all the com- 
plement types with which a verb may occur must be listed with 
their semo- syntact i c information. Therefore a new classification 
scheme was developed by the German group and is described in 
Section 2.3, which follows. 



2 . 3 Development of a General Classification System 

2 . 3 • 1 Purpose 

As described earlier in this report, the Center's lexical 
lists already contain a certain amount of syntactic and semo- 
syntactic information. A general system of lexical features was 
developed which will be used to add to our established German and 
English noun and verb lists the information necessary for analy- 
sis and translation and in future work on the classification of 
German and English adjectives. Work on the establishment of the 
necessary feature system for adverbials will be undertaken in the 
coming months. 

In general, two types of information are included in our 
feature system: 
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a) the properties of the classified lexical item; 
this information is shown as a value (or combination of values) 
of the subscript TY (type); 

b) the properties of the environment of the lexical 
item; for this purpose, several subscripts and possible values 
are used as described below. 

Note that some semo-syntact i c features occur as syntactic 
features to facilitate encoding (cf. the subscript RL under 
nouns, where nouns are given the feature "may take a wfien- 
clause" rather than the feature "noun of time"). 

In general, we indicate features which represent surface 
phenomena. If we find, upon inspection of the completed lists, 
that certain features can be predicted from the occurrence of 
others, they will be excluded from the dictionary and intro- 
duced by means of redundancy rules. 



2.3.2 Verb Features 



Each English or German verb will be given some or all of 
the following subscripts. Certain of these are necessary for 
all verb entries; these are underlined in the list below. Others 
are relevant only in one of the languages we are dealing with; 
these are marked by G for German and E for English. 



TY = type of verb (transitivity) 

TS = semantic type of subject 

FS = syntactic form of subject (this subscript is omitted 
if the verb allows only a noun phrase as subject) 

DS. - deep subject (indicated only if the deep subject does 
not occur as a nominative in the surface sentence) 

OB = syntactic form of object(s) or comp 1 emen t (s ) 

TO = semantic type of object 

RA s required adverbials 

OA a optional adverbials 



2.3.2 . 1 Values for Type (TY) 

VT B takes at least one object which is not a reflexive 

pronoun 

VTC == takes a cognate object only; we define a cognate object 
as the true cognate and all nouns subsumed under that 
term, as e.g., to dance, a MaZtz or a Katn dance. 

VR = takes an object which must be reflexive 
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VT, 


VR = 


takes at least 
reflexive and 


two objects, one of which must be 
one which is not reflexive 


VI 


= 


i n t rans i ti ve 




NP 


s 


the verb does 
not need this 


not passivize; verbs marked VI or VR do 
descriptor. 


NGg 


= 


the verb does 


not form the progressive. 



2. 3-2. 2 Values for Type of Subject (TS) 

The values which may be associated with the subscript TS 
are all semantic subcategories of nouns (cf. features for nouns 
below). In addition, the values 

E * entia (any type of noun) 

P = plural noun on I y 

may be used to describe the subject a verb requires. 





2. 


3.2.3 Values for Form 


of Subject (FS) 


NP 


= 


noun phrase 




IT 


= 






TH 


a 


that- c ] ause 




Ml 


a 


marked i nf i n i t i ve 




FTe 


= 


^OA.-to complement 




GRe 


a 


gerund 




ICL 


= 


interrogative clause 




IMI^ 


a 


interrogative adverb 


+ marked i n f i n i t i ve 


1 Iq 


= 


interrogative adverb 


+ unmarked infinitive 




2. 


3.2.4 Values for Deep 


Subject (DS) 


G 


= 


genitive 




0 


a 


dative 




A 


s 


accusative 





2. 3*2. 5 Values for Object or Complement Syntax (OB) 

Gq “ genitive 

Dq ** dative 
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Aq = accusat i ve 

Og * noun phrase (NP) as object 

all prepositions, spelled out; German prepositions which may 

govern the dative or accusative are marked by the num 
bers r (for accusative) or 2 (for dative), e.g., AN 1 , 
I N2 , etc . 

TH, Ml, etc. as deffned above for FS 
CL = mafn (subjunctive) clause 

PAPL = past participle 

I = unmarked infinitive 

BC = takes be + NP or ADJ 

CM = takes optional be + NP or ADJ (e.g., thZnk) 

NC = takes NP complement without be (e.g., ztzct) 

NA = takes NP or ADJ complement without be 

AC takes ADJ complement without be 

2. 3. 2. 6 Values for Type of Object (TO) 



These values are all noun sub-categories (cf. noun features 
below), plus the values 

E * entia (any type of noun) 

P = plural noun only 

R = reflexive 

RCC = reciprocal (e.g., aminandtfL ge^at&n) 

2 . 3 . 2. 7 Vaules for Required Adverbials (RA) 



PLC 


s 


place (locative or_ directional) 


DIR 


= 


direction to 


ORN 


s 


origin (direction from) 


TIM 


= 


time (punctual or_ durational) 


PNC 


ss 


punctua 1 


DUR 


s 


durat i onal 


MAN 


s 


manner 


MSB 


= 


measure 


AC 


s 


adjective complement (for sensory verbs, as e.g 
■irmZi good) 
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2.3>2.8 Value for Optional Adverbials (OA) 

The subscript OA is always associated with the same value: 
DOR = direction or origin (adverb of di rectionai i ty) 

2.3*3 Adjective Features 

Adjectives are given one or more of the following subscripts 
(oniy MD is mandatory): 



TY 


= 


type of adjective 


FM 


3 


form of adjective 


MD 


3 


modifies nouns of the specified type 


RA 


3 


requires an adverb (e.g., mhvika.it) 


OB 


= 


form of object 


TO 


= 


semantic type of object 



2 . 3 . 3 .I Values for Type of Adjective (TY) 

MSR =» measurabie (e.g., wXde or 6tK0ng as in j(-cve ZnchZ6 

Aeven men 6tAong) 

TM = the adjective may undergo "tough movement" (e.g., 

kafid, ta6y) 

2. 3*3*2 Values for Form of Adjective 

PRPL = the adjective is in form a present participie 
PAPL = pas t part i c i p 1 e 

2*3*3*3 Values for Type of Noun Modified (MD) 

All sub'categor i es of nouns (cf. noun features below) 

TH * thcLt~c\ause 

PLU = plural, mass, or collective noun 

2 . 3 * 3 *^ Values for Required Adverbials (RA) 

The possible values for the subscript RA are those given 
for the subscript RA for verbs (cf. verb features above). 
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2.3*3*5 Values for Form of Object (OB) 



•^G 


= 


gen i t i ve 




t>G 


= 


dative 




Ag 


= 


accus at i ve 




A1 1 


prepositions, spelled out; case government ambiguity in 

German prepositions is avoided by coding 1 (accusative) 
or 2 (dative) after the preposition. 




2.3 


.2.6 Values for Type of Object (TO) 




The 

features 


values for TO are all sub-categories 
below), and E (any type of noun). 


of nouns (cf. noun 




2.3 


.A Noun Features 




Nouns are semantically classified and in 
descriptors indicating the type of attributes 
take. The subscripts for nouns are; 


add! tion have 
whi ch they may 


II 


= 


type of noun 




SX 


= 


sex 




OB 




object (in case of deverbative nouns, 
dependence on) 


as e . g . , 


TO 


= 


semantic type of object 




TA 


= 


takes attribute 


. 


RL 


= 


relative adverb (for deverbative nouns) 


DF 


= 


derived f rom 


• 


FM 


ss 


form \for nominallzed adjectives) 






2.3 


.A.l Values for Type (TY) 




PO 


ss 


physical object 




AB 


= 


abs tract 




AN 


ss 


an i ma te 




PL 


= 


plant 




IN 


s 


i nan 1 ma te 




HU 


= 


human 




AL 


ss 


animal 
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NM * proper name 

CO = collective (components may be counted; can be used 

with the verb e.g., gJLOup, htJid, govt^nmznt) 

BP = body part 

MS = mass (homogeneous; may occur without article In the 

singular; e.g., m^tk, 6 and) 

MA a machine (since they can perform some human activities) 

QU a quantity ( + (oi) NP; e.g., g^oup, gta66 , hal^ , 

as in a gta66 mZlk) 

CN a count (abstract countable nouns, e.g., ^daa) 

UN = unit (ADV a QUANT + e.g., m^iz, yzaK, as i n 

mcZe6 tong, to watt two yzat6) 

These values may be used in combinations; e.g., the English 
noun QO V ZKnmznt \nh\ch has the features TY(HU CO, AB) indicating 
both. Human and collective. This value system may be represented 
in tree form as shown: 




2. 3*^*2 Values for Sex (SX) 

The subscript SX has two possible values: MA (male) and 

FE ( fema 1 e ) . 



2. 3. ^.3 Values for Object (OB) 

The values for the subscript OB (if relevant) are all prepo- 
sitions, spelled out, and followed. by the numbers 1 or 2 to 
indicate case government when the German preposition occurs with 
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dative or accusative: INI, etc.. 

2.3.A.i» Values for Type of Object (TO) 

The possible values for the subscript TO are PO, AB, etc., 
as defined under TY above. 



2. 3. A. 5 Values for Attributive (TA) 

ZU = marked infinitive (e.g., as in attempt to 

do ^omethtng) 

CL “ main clause, as In die Bekauptung, dlti> it el die (fJakx- 
helt 

TH = ^hat-clause (non-relative that- cl auses ; e.g., hl6 claim 

that thl6 u )06 60 ) 

DIR = directional adverbial complement (e.g., a talp acfio66 

EuJLope} 



2.3.A.6 Values for Relative Adverb (RL) 

WO = where (e.g., the ptace wheAe I 6au) you) 

WOHIN = whereto (e.g., the tou)n vikeAe you u)ent) 

WARUM = why (e.g., the Aea6on u)hy he did It) 

OB " whether (e.g., the que6tlon u)hetheA thl6 l6 6o) 

WIE = how (e.g., die fAage, u)le dle6 ge6chehen 6el) 

ALS * when (e.g., the time u)hen I lived theAe) 



2.3*h.8 Values for From (FM) 

The subscript FM may be used with only one value: A 

(adjective). For example, the German noun deA (or die) Abtluen- 
nlge [the Aenegade) is coded without inflectional ending and with 
the marker FM ( A) : 

ABTRUENNIG TY(HU) FM (A) . 
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SECTION III 



PROGRAMMING 



During this reporting period the programming effort was di- 
vided into three areas: grammar conversion programs, systems pro- 
grams, and supporting programs. 



3 . 1 Grammar Conversion 

In order to make use of the existing IBM 70 li 0 grammars and 
dictionaries it was necessary to convert them to a format suit- 
able to the CDC 6600. The Remote File Management System (RFMS), 
which was being developed to facilitate management of very large 
data bases, was chosen. This system of programs allows the user 
to define a data base in tree format with no restriction on the 
number of branches or levels. It is based on a completely inver- 
ted file system, and the updating and retrieval features it al- 
lows are based on set theoretical operations 



3.1.1 Remote File Management System (RFMS FI) 

The first conversion was to what will be called RFMS FI. 

This was simply an intermediate conversion designed to retain 
the information that was used by the IBM 70AO programs. The RFMS 
FI Data Base definition is as follows: 

1]LEVEL RULE NUMBER (NAME); 

3] DEGREE (NAME); 

4] LEFT SIDE TERM (TEXT); 

6] RIGHT SIDE TERM (RG) ; 

62] RIGHT SIDE SYMBOL (TEXT IN 6); 

63] B OPERATOR (NAME IN 6); 

64] S OPERATOR (NAME IN 6); 

7] TYPE WEIGHT INFORMATION (RG); 

71] TYPE (NAME IN 7); 

72] WEIGHT (NAME IN 7); 



The Data Base is constructed of rules whose entries each 
have a component number (e.g., 3 l), a name (e.g., DEGREE), and 
a data type (e.g., (NAME)). (RG), "repeating group", allows the 
following set of components to be repeated. In the above case. 
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each rule has only one left side but can have any number of terms 
on the right side. 

Both the English (ENG) and German (GER) machine processable die* 
tionaries and their syntactic and normal -form grammars were con* 
verted from IBM 70^0 to RFMS FI. The ENG dictionary was made up 
of RMD and WEBSTER, which were in different formats. 

3.1.2 Remote File Management System (RFMS F2) 

To allow the writing of grammars containing rules in terms 
of complex symbols composed of subscripts, values, operators, 
macro statements, dummy statements, and choice statements, RFMS 
F2 was designed and is defined as follows: 

llRULE NUMBER (NAME) ; 

2]RULE TYPES (RG) ; 

21 ]RULE TYPE (NAME I N 2) ; 

31DEGREE (NAME); 

4]MACR0 (RG); 

42] m category sym (name in 4); 

43] m subscript (RG I N i») ; 

A31 ]M OP I (NAME IN 43) ; 

432] M OP 2 (NAME IN 43) ; 

433] M locator (NAME IN 43); 

434] M subscript sym (name in 43); 

435] M value (RG IN 43) ; 

4351] M binary OP (NAME IN 435); 

4352] M unary op (NAME IN 435); 

4353] M value sym (NAME IN 435); 

436] M slash (NAME IN 43) ; 

52] L CATEGORY SYM (NAME); 

53] L SUBSCRIPT (RG) ; 

531] L OP 1 (NAME IN 53) ; 

532] L OP 2 (NAME IN 53) ; 

5331L LOCATOR (NAME IN 53) ; 

534] L SUBSCRIPT SYM (NAME IN 53); 

535] L VALUE (RG IN 53) ; 

5351]L BINARY OP (NAME IN 535); 
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5352]L UNARY OP (NAME IN 535); 

5353lL VALUE SYM (NAME IN 535); 

536]L slash (name IN 53); 

54] L op (RG); 

5^1]L OP SYM (NAME IN 5^) ; 

542]L op value (name IN 5^) ; 

55] L CHOICE (RG) ; 

551] L CHOICE NUMBER (NAME IN 55); 

552] L choice command (NAME IN 55); 

553] L CHOICE VALUE 1 (NAME IN 55); 

55^]L CHOICE VALUE (RG IN 55); 

5541 ]L CHOICE VALUE 2 (NAME IN 554); 

6 ]R SIDE (RG): 

6 1] R category op (name in 6 ); 

62] R category SYM (NAME IN 6 ); 

63] R subscript (RG in 6 ); 

631] R op 1 (NAME IN 63 ) ; 

632] R op 2 (NAME IN 63) ; 

633] R LOCATOR (NAME IN 63 ); 

634] R subscript SYM (NAME IN 63 ); 

635] R value (RG IN 63 ) ; 

6351] R BINARY OP (NAME IN 635 ); 

6352 ] R UNARY OP (NAME IN 635); 

6353 ] R VALUE SYM (NAME IN 635 ); 

636] R SLASH (NAME IN 63 ) ; 

64] R op (RG I N 6) ; 

641] R op SYM (NAME IN 64) ; 

642] R op value (name IN 64); 

65] R CHOICE (RG I N 6) ; 

651] R choice NUMBER (NAME IN 65 ); 

652] R choice op (NAME IN 65 ); 

653] R choice subscript SYM (NAME IN 65 ); 

654] R choice value (RG IN 65 ); 

6541]R choice BINARY OP (NAME IN 654) 
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65^2]R CHOICE UNARY OP (NAME IN 65^)*, 
65i»3]R CHOICE VALUE 2 (NAME IN 65^); 

7] DUMMY (RG); 

72] D CATEGORY SYM (NAME IN 7); 

73] D SUBSCRIPT (RG IN 7) ; 

731] D op 1 (NAME IN 73); 

732] D op 2 (NAME I N 73) ; 

733] D LOCATOR (NAME IN 73); 

73A]D SUBSCRIPT SYM (NAME IN 73); 

735] D VALUE (RG I N 73) ; 

7351 ]D BINARY OP (NAME IN 735); 

7352] D UNARY OP (NAME IN 735); 

7353) 0 VALUE SYM (NAME IN 735); 

736] D slash (name IN 73); 

7A] D OP (RG IN 7) ; 

7A1 ]D OP SYM (NAME IN 7^) ; 

7A2]D op value (name IN 7^); 

8] TYPE WEIGHT PROBABILITY (RG); 

81] TWP ASSOCIATION NUMBER (NAME IN 8); 

82] TYPE (NAME IN 8) ; 

83 ] WE I GHT (NAME I N 8) ; 

8A]PR0BABILITY (name in 8); 

9] TRANSFER CROSS REFERENCE (RG); 

9ijTRANSFER ROLE NUMBER (NAME IN 9); 



The German RFMS FI dictionary was converted to the RFMS F2 
format, and work was begun toward the conflation of the 1 ncom” 
plete English RMD and WEBSTER dictionaries and their ultimate 
conversion to RFMS F2. 

Work was also done toward the conversion of the normal-form 
grammars to RFMS F2 format. As it was not possible to tell 
whether the interlingual substitution symbols were constructed 
of GER, ENG, or RUS (Russian) transfer names, it was necessary 
to set up a complicated conversion procedure. This involved 
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classifying a greater part of the 160,000 Interlingual substitu- 
tion symbols by hand. 

When the normal-form grammars are converted, all such sym- 
bols will be reduced to their English part, and duplicate rules 
will be eliminated, resulting In much smaller normal-form gram- 
mars . 



3 • 2 Systems Programs 

The following systems programs were designed for the dic- 
tionary phase of the translation system: 

a) grammar sort (DICT GS) 

b) tr? construction (DICT TC) 

c) analys Is (DICT A) 

d) text display for DICT A (MATRIX) 

e) choice (DICT C) 

f) workspace display for DICT A and DICT C. 

A subscript grammar program (SUB GRM) was designed for con- 
version of linguistic coding format Into the full RFMS F2 foimat. 

These programs are described below In 3.k. 

3 • 3 Supporting Programs 

Supporting programs were designed to: 

a) update the working lexical lists (LIST UP), cf. 3.^; 

b) produce new concordances (REQ CON), cf. 3.^; 

c) collect statistical data; 

d) automate time-consuming linguistic operations; 

e) convert working lexical lists Into an intermediate format 

for subsequent conversion Into subscript format; 

f) recognize poly-word entries In dictionary rules; 

g) selectively display dictionary rules according to type or 

class name; 

h) generate allomorphs for the German verb list, producing 

30,000 entries from an original I7»000; 

I) add class names occurring In the form prefix-stem to entries 
In the German dictionary; 

j) convert the German noun list to an intermediate format more 
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amenable to updating and conversion to subscript format, 
i.e., each specific kind of information is assigned a spe- 
cific line number. 

The old grammar display program was expanded to include: 

a) an analysis sort which sorts terms r ight- to- 1 ef t , and 

b) a dictionary sort with the constituents of the right-side 

terms concatenated. 



3 . 4 Program Descriptions 

3.4.1 Dictionary Analysis (DICT A) 

Using the compiled dictionary tree constructed by DICT TC, 
the dictionary analysis program (DICT A) analyzes text and gene- 
rates a workspace to be used by the dictionary choice (DICT C) 
program. The compiled tree is initially loaded onto the disk in 
random format and the maximum number of blocks possible is kept 
In memory at all times during analysis. Statistics are kept con- 
cerning the use of each block In memory. If a new block must be 
added, the previously loaded block with the least amount of ac- 
cesses is discarded. 

DICT A has two input parameters, the K-option indicator and 
the display indicator. If the K-optlon is on, the rules Inter- 
preting endings are applied everywhere except after a punctuation 
mark or a space. If the K-option is off, these rules are applied 
only after a morpheme boundary. The display indicator selects 
the sort option for the display of the resulting workspace. These 
options are from-to sorts, to-from sorts, or both. 

For each file entry, the display contains the rule which 
applied and its number, the I terns "FROM" (text position where the 
entry begins), "TO" (text position where the entry ends), and a 
condition code for the appi i cat i on of rules interpreting the im- 
mediate right context. 

The analysis program creates a table containing entries of 
text character sequences which match the compiled tree. Each 
table entry contains three items of information concerning the 
sequence: the location of the node in the tree, the starting 
character (or file), and the number of characters at this point. 

The text consists of N characters (numbered from I to N) . 

For each character position I, an associated file 1 is created 
which contains entries whose terminal strings end at position X. 
Entries for file I are referred to as FEI.1, FE1.2, etc.. 

Every sequence of characters defining a terminal in the tree 



o 

ERIC 



M I -6 

48 



t 



has as its second character a B, E, or blank represented by ^ 

(see DICT GS). Thus at this point the node will be either a B, 

E, or *. If more than one character occurs at this point In the 
tree, these characters will be linked together by down pointers 
indicating branches in the tree. 

For each text character processed, a new table entry is con- 
structed if that character may begin a sequence. 

Each table entry already constructed is processed as follows: 

a) a new file entry is constructed if a sequence ended in 

the last file; 

b) the table entry is updated by the new node position and 

the character count is either destroyed or incremented 

according to whether — 

(1) the sequence does or does not continue as part of 

another rule, and 

(2) the second character of the sequence is or is not 

being processed; 

c) the starting branch conditions are evaluated (as opposed 

to character matches being performed as in the cases 

above), if — 

(1) a sequence continues as part of another rule, and 

(2) the second character of the sequence is being 

processed . 

If the second character of the sequence being processed is: 
B, the string may not begin if the previous file — 

(1) does not contain an interpreted string, 

(2) contains a punctuation mark, or 

(3) contains a blank; 

E, the string may begin; 

*, the string may not begin if the previous file does not contain 

an interpreted string. 

During the processing of the second character, the first 
reference to the table entry modifies the entry. All future 
references create new table entries. After all table entries for 
the character are processed, the table is resorted to put the 
longest sequence first, if and only If there were any multiple 
second-character table-entry constructions. 

As each new file entry Is constructed, the left-side opera- 
tors M, - 1 , and ® are used to compute the value for the FROM file. 
This value will be used by the following file to determine 
whether the new file may be constructed. If ‘■he second character 
is a P and the value of the previous file indicates a blank or 
punctuation mark, a new file entry Is completed. 
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3.4.2 Dictionary Choice (DICT C) 

DICT C processes the from-to workspace output from DICT A. 

It discards all file entries from the workspace which do not be- 
long to a sequence of rules which span M-symbols. (The M-symbols 
are, primarily, blanks or punctuation marks» including hyphens.) 

It also generates K-rules for all M-symbol sequences which are 
not spanned. It has four input parameters. The first sets the 
K-optlon either on or off (cf. 3*4.1). The second sets the 
preference- (P-)opt ion either on or off. The third records which 
workspace display is requested for output — the options being any 
choice or combination of: to-from, from-to, or, all deleted file 
entries. The fourth parameter indicates whether the from-to 
workspace should be saved or destroyed. (Word analysis uses 
workspace In the to-from format.) 

DICT C reads In file entries until it finds a group which 
completely spans two M-symbols. It processes this group and then 
reads in the next group. 

The first operation performed Is the elimination of all file 
entries from this group which have right-side F-operators and are 
not followed by an M-symbol. An F-operator is assigned to all 
rules for which only punctuation or ** can follow. 

If the P-option is on, all other sequences or file entries 
covering the same span are discarded from any file entry having a 
left-side P-operator. The P-operator in a rule gives preference 
to a long span over two or more short spans. 

The rules used in all possible sequences covering the span 
are tagged for later processing. Processing for this span is 
terminated when a possible sequence is found without M-symbols 
resulting from a rule with a multi-word right-side. If a possible 
sequence with an internal M-symbol is found, all possible se~ 
quences are calculated for each subspan. If a subspan is not 
completely covered, a K-rule is generated. When the K-optlon is 
on, additional K-rules are constructed which link together all 
possibilities for prefixes and suffixes. 

If the original span was not covered, a K-rule is generated 
to cover It. Additional K-rules are also generated for sequences 
of the form: prefix-K, pref ix-K-suff lx, and K-suffix; and each of 
these sequences covers the original span. 

3 . 4.3 Dictionary Grammar Sort (DICT GS) 

DICT GS has two major functions — a) to determine the re- 
strictions on the application of a particular •*ule, and b) to 
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sort the dictionary grammar according to the right-side term(s) 
and the application restriction information (ARl). -j 

The dictionary contains sixty-four roots. From each root 
four branches may theoretically extend which represent the re- 
strictions for all terminals. These branches are the [P]-re- 
striction, the [®]-res tr i ct Ion , [B] - res tr i ct i on , and [E]-re- 
striction. The [P]-restr i ct ion indicates that the rule may apply 
to a string which is preceded by a punctuation mark or blank. 

Both the [®]-restriction and the [B] - res t r i ct I on Indicate that 
the rule may apply to a string which is contiguous to a preceding 
interpreted string. The [ B] - res tr 1 ct 1 on also indicates that the 
span must not be preceded by a punctuation mark or blank. The 
[E] -restriction indicates that the rule may apply anywhere; there 
are no restrictions in this case. 

To construct a grammar tree, the ARl of the rules needs to 
be retained. Therefore, depending upon the ARl in the rule, the 
program DICT GS inserts a "B", "E", or (The [P] - res tr 1 ct 1 on 

is included under the ^-indicator at this point. In the surface 
dictionary analysis, a distinction is made.) 

DICT GS strips RFMS loader-format repeating-group names and 
extraneous Information from the rule. The program generates sort 
keys, consisting of the right-side terms and the ARl, and retains 
the left-side terms and ARl as data. The ARl indicator is the 
second character in the sort key. 

Each rule in the dictionary grammar is converted to the fol- 
lowing form in DICT GS — 

Word 1: Length of ri*’e (revised for SORT/MERGE 

routine), M, and length of sort key, N; 

Words 2 -»• (N+1): Sort key; 

Words (N+2) -»■ M: Sort data area. 

The program then sorts the rules in the dictionary grammar^ 
which are in the form listed above. 

Finally, DICT GS creates a new tape consisting of two re- 
cords. The first record contains information concerning the 
length of the longest sort key created, the length of the longest 
data area created, and the date the new tape was created. The 
second record contains the sorted dictionary grammar. This new 
file is used as input to the dictionary tree construction program. 

3.4.^ Dictionary Tree Construction (DICT TC) 

DICT TC builds the compiled dictionary tree and Its index 
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from the output of D I CT GS. It reads In one entry .at a time, 
comparing it character by character with the previous entry. 

Where the character strings differ, a down pointer is attached to 
the previous string to indicate the place where the new string 
continues. If the old string is a subset of the new string, a 
continuation (or right) pointer is attached to the end of the old 
string. In both cases, after all the characters are placed in 
the tree, the remaining information (e.g., the rule number and 
the left-side of the rule) is added at the end of the string. 
Another new entry is read in and the process is repeated. Each 
time a new first character is encountered, a pointer is placed In 
the index table. Thus, after the process is completed, there is 
a pointer to the beginning of every character tree. The index 
and the compiled tree are then written out In a form suitable for 
use by Dl CT A. 



3 . A. 5 Subscript Grammar (SUd GRM) 

SUB GRM converts subscript rules from the form in which the 
linguists encode them into RFMS F2 Loader Input format. Rule 
numbers and duplication numbers are optional input. All rules 
containing format errors are discarded. 



3 . A. 6 List Update (LIST UP) 

LIST UP updates all the working lexical lists. These are In 
the form of card images, each of which is indexed by corpus, re- 
quest, and line numbers. LIST UP allows additions, deletions, 
insertions, and replacements on a card- for- card basis. The out- 
put consists of ail requests plus ail changes, or only those re- 
quests for which a change was made, and a new updated tape. 



3 . A. 7 Concordance Program (REQ CON) 

A new concordance program was constructed having the follow- 
ing features: 

a) A display of the concorded word in the context of 
the entire request (identified by the digits in columns ^-7) 

b) Forward and/or backward sorts. Each sort includes 
all the words in the request. In the forward sort the re- 
quest, in the following succession, is used to determine the 
list order for the concorded word — 

1 ) concorded word 

2) words in sequence to the right of the concor- 

ded word 
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3) a "zero word", inserted at the end of the re- 

quest, which takes precedence over any other 
word at the same point 

4) words in sequence to the left of the concorded 

word. 

A backward sort takes the words to the left first, and then 
the words to the right, inserting the "zero word" at the 
beginning of the request. 

c) An inclusion/exclusion option 

d) A glossary of all concorded words with their 
f requenci es 

e) A choice of no display or any of three forms of 
output display— all the requests, only those requests which 
were used, or, the requests not used 

f) Standard or non-standard procedure for concording 
words. Standard is based on the occurrence of the word 
itself; non-standard refers to words preceded by a special 
character. For the latter, pre-processing programs for 
tagging the words to be concorded may be required. 

g) Concordance restrlctable to specified sequences of 
starting characters. This capability permits the recovery 
of information when the capacity of the computer is ex- 
ceeded . 
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CONCLUSION 



Progress under the contract has been good. In spite of re- 
duced funding. The theory underlying the Linguistics Research 
System has been developed. The linguistic descriptions which 
are necessary to Implement this theory, however, have not met 
our original projections, because of lack of manpower. Program- 
ming has also suffered from the reduction In funding. 



During the remainder of the contract period a lexicon will 
be produced which will have "precise Information on the syntactic 
and semantic properties of lexical Items". Preliminary grammars 
as required for the implementation of the Linguistics Research 
System will be produced. 



Much of the programming effort has been concerned with 
bringing our linguistic data into the formats required by the 
Linguistics Research System, and with updating the Center's lexi- 
cal data bases. In the last two years of work under the contract, 
programs will be constructed for handling the grammars described 
in Section I of this report and the German and English lexical 
data . 
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